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the BlUs. cross channel comparisons, and mesh trans- 
action assessments. Fault identification and mesh 
reconfiguration for the mesh is performed such that no 
faulty unit remains active in decision making after recon- 
figuration, and the number of good units isofated during 
reconfiguration is minimized. 
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Description 

Field of the Invention 

The present invention generally relates to a fault- 
tolerant computer system and, more particularly, to 
improved fault identification and reconfiguration per- 
fbmied in a mesh an-ay computer architecture. 

Background of the Invention 

Fault-tolerant computing is the art of building com- 
puting systems that continue to operate satisfactorily in 
the presence of faults (i.e. hardware or software fail- 
ures). For example, large commercial aircraft typically 
Include complex flight control computer systems. To 
ensure the safety of the passengers in the event of a 
fault the avionics systems of such aircraft can include 
three or more redundant computer systems. The sys- 
tems process the same data from a common source or 
redundant data from different sources monitoring the 
same parameter. Should a fault cause an en-or in the 
output of one of the redundant computer systems, the 
outputs of the other systems that agree are used by the 
avionics system. The step of selecting data for use In a 
process from among the outputs of redundant sources 
is caUed "voting." which is a fbmi of "fault masking" 
because the erroneous component Is ignaed. 

Although noise or other transient occurrences can 
produce one-time disagreements that do not have long 
term adverse impact on the overall avionics system, 
continuing disagreements usually indicate a failed conv 
ponent or a break in communications on one of the 
channels. Accordingly, fauH-tolerant systems can be 
configured to lock out one of the redundant channels if 
it continues to produce results tiiat differ from the otiier 
channels. Reconfiguring a fault-tolerant system in this 
manner is a form of "dynamic recovery." In another 
dynamic recovery procedure faulty modules are identi- 
fied and switched out of the system and are substituted 
with a system spare. Thereafter, the system instigates 
roD bacK Initialization, retry, or restart actions neces- 
sary to continue ongoing conrputations. 

One effective fault-tolerant "uni-processor architec- 
ture" (multiple processors acting as one processor) tiiai 
uses fault masking and dynamic recovery is the mesh 
interconnected array. The mesh interconnected anray is 
a matrix of nodes connecting identical central process- 
ing units (CPUs) to identical memory modules. This 
approach can be represented as multiple identical 
CPUs horizontally disposed with a CPU bus extending 
vertically below each CPU. Multiple identical memory 
modules are vertically disposed with memory buses 
extending horizontally to intersect the CPU buses at 
nodes or bus interface units (BlUs). Existing mesh 
architecturiBS perform fault masking and dynamic recov- 
ery by comparing pairs of BlUs. One BlU acts as a vot- 
ing master and another as a voting checker. The 
checker is compared with tiie master and does not 



actively pass data. This is more commonly known as 
master-checker pairs and is an effective masking tech- 
nique in fault-tolerant systems. However, the master- 
checker relationship relies on strict majority vote rules 
and ignores tiie uniqueness of the mesh architecture 
array. Also, these systems assume fault free voting 
mechanisms and therefore fail to perform error analysis 
witiiin the BlU circuitry. Otiier examples of prior art fault- 
tolerant architectures are shown in Budde et al. U.S. 
patents 4.438.494; 4,503.534; and 4.503.535. 

Summary of the Invention 

The present invention provides a reliable mesh 
interconnected array in a fault-tolerant computer sys- 
tem. The system includes multiple central processing 
units (CPUs), multiple memory units, and bus interface 
units (BlUs). The BlUs are located at the intersections 
of multiple vertical CPU buses and multiple horizontal 
memory buses. The result is a matrix of BlUs. Each BIU 
contains a transaction controller for transmitting a trans- 
action including at least one of tiie following: address 
data, read dafa, write data, or control data. A mesh bus 
connects all BlUs witiiin a single anay of BlUs. The 
BlUs comprise an en'or detector for detecting correcta- 
ble, retryable and non-retryable en'ors in the transmitted 
data of a transaction. An error corrector in each BlU cor- 
rects any con^ectable errors detected. A me^ enror 
reporter in each BlU asserts to all the BlUs via tiie mesh 
bus If a retryable or non-retryable error was detected by 
tiie error detector. A retry mechanism sends the dafa 
back to the en^or detector a preset number of times rf a 
retryable error remains asserted on tiie mesh bits. A 
fault manager Isolates and eliminates BIUs w'rth 
detected errors remaining after operation of the retry 
mechanism by reconfiguring tiie active status of each 
BIU connected on the mesh bus according to the type of 
remaining detected error(s). A continuation mechanism 
completes tiie transaction, if no en'or(s) remain 
asserted on tiie mesh bus. 

In accordance witii otiier aspects of this Invention, 
the error detector further comprises a single thread read 
back mechanism for detecting errors of data written to 
memory when only a single BIU or a single CPU chan- 
nel is active. 

In accordance with f urtiier aspects of tiiis invention, 
tiie fault manager wfttiin each BIU detects any self- 
implicating errors, shuts down each BIU that detects an 
uncorrectable self-implicating error, asserts a message 
to the mesh bus Indicating the functioning status (active 
or inactive) of each BIU, and reconfigures the status of 
tiie BlUs by performing a configuration consistency 
algoritiim. 

In accordance witii yet otiier aspects of this inven- 
tion, tiie fault management processor within each BIU 
detects any synchronization errors, asserts a first eaor 
message to tiie mesh bits if a synchronization en-or was 
detected, performs a first reconfiguration algoritiim if a 
f irst error message was asserted on the mesh bus and 
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performs the configuration consistency algorithm. 

In accordance with still further aspects of this inven- 
tion, the fault management processor within each BlU 
detects any memory bus errors, asserts a second error 
message to the mesh bus if a mennory bits error was 
detected, performs a second reconfiguration algorithm if 
a second error message was asserted on the mesh bus 
and performs the configuration consistency algorithm. 

In accordance with yet still other aspects of this 
invention, the fault management processor within each 
BlU detects any CPU bus en-ors, asserts a third error 
message to the mesh bus if a CPU hits en-or was 
detected, performs a third reconfiguration algorithm if a 
third error message was asserted on the mesh bus and 
performs the configuration consistency algorithm. 

In accordance with other aspects of this invention, 
the system includes as minimum architecture a CPU, at 
least one memory unit, and a BlU. The system of the 
present invention preferably includes muKiple CPUs, 
multiple memory units, and multiple BlUs for greater tol- 
erance of faults. The invention provides for fault identifi- 
cation and reconfiguration in a mesh anray architecture 
that allows for single channel operation. 

Brief Description of the Drawings 

The foregoing aspects and many of the attendant 
advantages of this invention will become more readily 
appreciated as the same becomes better understood by 
reference to the following detailed description, when 
taken in conjunction with the accompanying drawings, 
wherein: 

FIGURE 1 is a block diagram of a mesh intercon- 
nected an'ay in a fault-tolerant computer system in 
accordance with the present invention using a 2 by 
3 matrix of bus Interface units (BlUs); 
FIGURE 2 is a positional assignment chart of the 
diagram of FIGURE 1 ; 

FIGURE 3 is a block diagram of a mesh Intercon- 
nected anr^ in a fault-tolerant computer system in 
accordance with the present invention using two 2 
by 3 matrices of BlUs performing off the same 
CPUs; 

FIGURE 4 is a block diagram of the internal compo- 
nents of the BlUs of the configuration of FIGURE 1 
or FIGURE 3; 

FIGURES 5-13 are flow diagrams illustrating steps 
performed to achieve mesh reconfiguration in 
accordance with present invention; 
FIGURES 14-19 are charts illustrating algorithms 
that perform the reconfiguration: 
FIGURE 20 is a diagram of the configuration con- 
sistency check circuitry; and • 
FIGURES 21 and 22 are charts illustrating master- 
ship assignment performed In the circuitry of FIG- 
URE 20. 



Detailed Description of the Prefeffed Embodiment 

FIGURE 1 is a block diagram of a plurality of bus 
interface units (BlU) 54 at nodes of a fault tolerant 
5 matrix (FTM) anray in accordance with a prefenred 
embodiment of the present inventioa Mutt'ple redun- 
dant central processing units (CPUs) 56 have associ- 
ated CPU buses 50 intersecting memory buses 51 . The 
memory buses extend from redundant memory units 
10 57. In the architecture of FIGURE 1 , a 2 by 3 matrix con- 
sisting of two rows of memory units 57 and three col- 
umns of CPUs 56 is used. Each BlU 54 provides a 
connection or node between a CPU bus 50 and a merh- 
ory bus 51, i.e., a connection is provided for each coi- 
rs umn CPU 56 to each row memory unit 57, Each BlU is 
identified first by the row number of the connected 
memory unit 57 and then by the column number of the 
connected CPU 56. All the BlUs 54 in a single FTM are 
interconnected by a mesh bits 52. As shown in FIGURE 
20 2, the location of each BlU 54 has an associated vector 
bit position which is part of a configuration vector in 
reconfiguration processing, described in more detail 
below with reference to FIGURES 19 and 20. 

The present invention is capable of operating vari- 
es ous configurations, such as a single CPU 56 connected 
through a single BlU 54 to a single memory unit 57. The 
present invention also operates effectively if a single 
CPU is used with multiple redundant memory units, or a 
single memory channel is used with multqale redundant 
30 CPUs. A single CPU channel Is identified by a single 
CPU 56 and CPU bus 50 connected to at least one 
memory bits 51 with a conresponding memory unit 57. A 
single memory channel has a single memory bus 51 
and memory unit 57 connected to at least one CPU 56. 
35 The system is also effective when disjoint BlUs 54 are 
present. BlUs 54 are disjoint when two or more BlUs 54 
fail to occupy the same CPU bus 50 and memory bus 
51. The present invention is capable of handling matrix 
arrays larger tiian tinat shown In FIGURE 1 . However, for 
40 the purpose of this description, a 2 by 3 matrix array is 
effective for showing the redundant capabilities of the 
present invention. It Is also noted that larger matrix 
arrays provkie diminishing improvements in fault toler- 
ant capabilities. 
45 FIGURE 3 is a block diagram illusti-ation of two FTM 
arrays operating from the same set of CPUs 56. but 
connected to distinct memory systems. A first 2 by 3 
mesh, mesh A, has two memory units 57 each consist- 
ing of an electrically erasable programmable read-only 
so memory (EEPROM) 57a. a single port random access 
memory (SRAM) 57b and a dual port random access 
memory (DPRAM) 57c. The memory rows of a second 
2 by 3 mesh B comprise only a SRAM 57d and DPRAM 
57e. Mesh B perfomis distinct yet synchronous opera* 
55 tions from tiie same CPUs 56 that are coupled by mesh 
A. TTie CPUs 56 can operate a greater variety of func- 
tions with two FTMs connected. However, synchroniza- 
tion between the FTMs is a concern during operation. It 
can be appreciated by one of ordinary skill in tiie art tiiat 
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the mesh configuration of the present invention can inte- 
grate with various types of memory and input/oulput 
(I/O). System data requirements dictate the memory 
and I/O requirements. One I/O system the present 
invention can integrate with is an aircraft system inte- 
grated modular avionics box. such as an ARING 629 
bus attached to the DPRAMs. 

The purpose of the present invention is to effec- 
tively deliver or retrieve data from memory in the pres- 
ence of enorCs). During a no error situation, all the BlUs 
have assigned roles. Each row and column has a BID 
assigned a master role and the rest of the BIU(s) of that 
row or column are assigned checker roles. In a memory 
write, the master BlU is the only BlU allowed to write 
data to memory in each memory channel. Each CPU 
channel has a master BlU for data transactions between 
the BlUs 54 and the CPUs 56. The checker BlUs are 
used to check the master BlUs for correctness. Only 
active BlUs participate in all checking or voting per- 
formed in the algorithms of the present invention. When 
the system is fully functioning, two important interre- 
lated tasks are performed. The first task detects errors 
internal and external to every BlU in a FTM. The other 
task is that of reconfiguration of the BlU*s status accord- 
ing to the error detection task results. Reconfiguration is 
performed by a series of algorithms, described below 
with reference to FIGURES 14—21. 

FIGURE 4 is a block diagram of the internal compo- 
nents of a bus interface unit 54 with corresponding bus 
connections. The bus interface unit 54 includes an 
address datapath 58, a data datapath 59 and a transac- 
tion controller 63 connected there between. An irtitialt- 
zation controller 64 also is connected between the 
address datapath 58 and data datapath 59. A fault iden- 
tification and reconfiguration (FDIR) datapath 62 inter- 
acts with the transaction controller 63 and the 
initialization controller 64, and an input/output (I/O) con- 
troller 61 interacts with all other components of the BlU. 
Each BlU 54 includes a dual state machine. Ps shown 
in FIGURE 4, each controller and circuit path 58. 59, 61 • 
64 has dual identical circuitry as illustrated by the shad- 
owed black boxes. The dual state machine includes the 
dual identical circuits. The dual state machine provides 
comparison checking of many of tiie processes per- 
formed witiiin the bus Interfece unit 54. 

More specifically, as shown in FIGURE 4, the 
address datapath 58 connects externally to CPU bus 50 
(shown at the left) and memory bus 51 (shown at the 
right). Internally, the address datapatii 58 is coupled to 
transaction controller 63. irvtialization controller 64. and 
the data datapatii 59. The address datapath 58 receives 
address data from the CPU bits 50 via a CPU address 
bus 66 and communicates witii tiie memory bus 51 via 
a memory address bus 67. The data datapatii 59 is 
internally coupled to transaction controller 63. initializa- 
tion controller 64 and address datapatii 58 (represented 
by tiie bold line 70 at tiie center). The data datapath 59 
externally communicates with CPU bus 50 via a CPU 
data bus 58 and memory bus 51 via a memory data bits 



69. Transaction controller 63 receives control data from 
CPU bus 50 via a CPU control bus 72 and is internally 
connected to all of tiie BlU's internal components 
except the Initialization controller 64. Initialization con- 

5 ti-dler 64 has no direct external connections and is inter- 
nally connected to all of the BIU*s conrponents except 
the transaction controller 63. 1/0 controller 61 communi- 
cates memory conti'ol data to and from memory bus 51 
via a memory control bus 71 and communicates witii 

10 the other BlUs in the FTM by the mesh bus 52. Inter- 
nally, the I/O controller 61 is also connected to the FDIR 
datapath 62. 

Transaction controller 63 determines from CPU 
control data the transaction that the BlU 54 is to perform 

15 and commands each of the components to perform 
according to the determined transaction. Some exam- 
ples of transactions are data reads from memory along 
bus 69, data writes to memory along bus 69. and 
address accesses of memory along bus 67. The 

20 address datapath 58 processes address transactions 
and tiie data datapath 59 processes all read and write 
data transactions. Transaction controller 63 also con- 
trols the synchronization of all tiie components within 
tiie BlU 54 tiius ensuring efficient processing of all infor- 

25 mation passing in and tiirough the BlU 54. 

Initialization controller 64 provides control of the 
BlU 54 during startup and any required instruction 
fetches required during startup. 

The FDIR datapatii of each BlU 54 within an FTM 

30 Stores a word (configuration vector) that corresponds to 
a status value of each BIU witiiin the FTM. The precise 
storage location is described below with reference to 
FIGURE 20. In a FTM witii 6 BlUs the configuration vec- 
tor is a six-bit word, each brt representing the status of a 

35 BIU within tiie FTM. Referring to FIGURE 2. each BIU 
54 has a position in tiie FTM. The position has a con-e- 
sponding bit in the configuration vector. For example, if 
tiie configuration vector was 01 1 101 , the BlUs in vector 
bit position 0, bottom left of FIGURE 1. and vector bit 

40 position 4, top center of FIGURE 1 , are considered non- 
operative because of the zeroes in those respective 
positions. The considered non-operative BlUs are 
restricted from FTM operations provided all the BlUs 54 
agree witii this configuration vector, see below with ref- 

45 erence to FIGURES 19-22, The FDIR datapatii 62 
reconfigures tiie configuration vector upon identification 
of faufts from the otiier components within the BIU 54. 
or according to a change in status of other BlUs 
received from the mesh bus 52. The reconfiguration 

so algoritiim is described below witii reference to. FIG- 
URES 19 and 20. 

nnally. I/O controller 61 provides an Interfacing 
controlling unit with memory bus 51 . mesh bus 52 and 
tiie internal components connected to the. I/O controller 

55 61. 

Each BIU 54 in tiie FTM continuously detects errors 
of data the BIU 54 receives, processes, and transmits. 
Two types of detected errors are value en-ors and syn- 
chronization errors. A value error is an error in the data 
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received, processed or transmitted by tlie Bill 54. A 
synchronization error arises from inconsistencies in 
synchronization between the components within a BlU 
54, inconsistencies in synchronization between the 
BlUs within an FTM, and synchronization inconsisten- $ 
cies with a second FTIVI as shown in FIGURE 3. It can 
be appreciated by one of ordinary skill that any addi- 
tional units Interacting with a FTM must be checked for 
synchronicity. Perlbrmance of dual slate machine com- 
parisons, parity checks, complementary signal checks, w 
transaction consistency comparisons, wraparound 
comparisons of signals sent off the BIU, clock phase 
comparisons and mesh votes (described below with rel- 
erence to FIGURES 19 and 20) provide continuous 
analysis for errors that may occur anywhere in a trans- 75 
action. Also, the analysis for errors allows each BIU to 
perform self-implicating error evaluations, thus provid- 
ing a FTM that can operate with only one active BIU. 

The BlUs 54 are preprogrammed to characterize 
value and synchronization errors in two ways. The 20 
en'ors are first characterized as a BIU self-implicating 
en-or, a synchronization consistency error, a memory 
bus error or a CPU bus error. This Identification is \sA& 
used by the fault management procedure for further iso- 
lating the errors (see FIGURE 11). Also the enters are 25 
evaluated as either correctable, retryable or non-retrya- 
ble. The second evaluation greatly affects how the fault 
management procedure reacts to the discovered eror 
(described in more detail below with reference to FIG- 
URES 5 and 20). 30 

Transaction Data Row 

FIGURES 5—13 illustrate the fault-tolerant process 
of a prefenred embodiment off the present inventioa All 35 
CPU memory transactions are processed through the 
BlUs 54 and are variations of reads to and writes from 
memory. As shown in FIGURE 5, each transaction 
received by a BIU 54 is analyzed for errors during phase 
operation, specifically an address phase and a data 40 
phase (110). FIGURES 6 and 8, described below, illus- 
trate the error determining steps perfbrmed in the 
address and the data phases. If the en-or analysis deter- 
mines that a retryable or non-retryable error is present 
in either of the phases, a message indicating so is 45 
asserted to the other BlUs via the mesh bus 52 (111) 
(non-retryable errors are described below with refer- 
ence to FIGURE 11). However, if the error analysis 
determines that the en^or present was correctable, the 
data is conrected (1 12). If no error was discovered dur- so 
ing the phase analysis ai 110, or if only con^ectable 
errors are present and are con-ected at 1 12, the system 
proceeds to a decision 115. If at the decision 115 it is 
determined that no enor Is asserted on the mesh bus 
52, the system proceeds to processing of the next trans- ss 
action, if an en-or is asserted on the mesh bus 52. the 
system proceeds to the fault management procedure 
116. 

FIGURE 6 illustrates the processing steps per- 



fbrmed within the address phase of cperation. Initially, a 
case selection 144 is performed on the data received in 
the address phase. Case selection 144 determines 
whether the data received is from a new transaction or 
a retry, branch 144a. or from a readback or a memory 
scrub, branch 144b. An address phase retry Is per- 
formed in a retry procedure described below with refer- 
ence to FIGURE 9. and the readback case is described 
below with reference to FIGURE 1 3. The memory scrub 
case performs correction of inconrectly stored informa- 
tion determined correctable during a read transaction. 

If the case Is a new transaction or retry, branch 
144a. the CPU drives the address at 140 to the BIU 
which latches the address at 141 . Then the address is 
decoded and a cache RAM hit/miss determination is 
perfbrmed at 142. In this embodiment cache in the 
address datapath 58 provides faster access to instruc- 
tions/data stored in EEPROM. The address phase then 
performs a synchronization procedure at 145 and a nul- 
lification procedure at 147 (both procedures are illus- 
trated in FIGURE 7 and described below). Prior to the 
nullification procedure, the BIU drives the address to 
menrary and memory controls are asserted at 146 onto 
the memory control bus 71 . The address phase analysis 
of the new transaction or retry, branch 1 44a, is complete 
upon completion of the nullification procedure, thus 
returning any en'ors determined In the address phase 
procedure, synchronization procedure or nullification 
procedure. 

If the address phase procedure selects the case of 
a readback or memory saub, branch 144b. tiie address 
phase procedure performs all of the same tasks as the 
retry case except the BIU uses the previously held 
address (143) and the procedure does not perform the 
nullification procedure. 

FIGURE 7 Illustrates the synchronization procedure 
at 145 of FIGURE 6 and the nullification procedure at 
147 that operate within the address phase. The syn- 
chronization procedure performs three simultaneous 
functions shown as patiis 169a. b and c. One function. 
169a, determines If a synchronization en^or ocojrs in 
the address phase of operation. The other two simulta- 
neous functions, paths 169b and c, drive a FTM 
address flag if the BIU thinks it is being addressed by 
the CPU and drives a mesh consistency signal If the 
BIU believes its address phase operation is consistent 
with tiie memory in memory units 67. The driven 
address flag is qualified with any driven enror signals at 
175. then passed to the algorithm beginning at 176a 
and 180a, described below witii reference to FIGURE 
1 6A. The driven consistency signal is similariy qualified 
with tiie error signals at 177 and processed tinrough an 
algoritiim at 176b and 180b. described below witii refer- 
ence to FIGURE 17A. In 181. resulting actions are per- 
formed on the analysis of data from the beginning of the 
synchronization process. If no flag signal exists, no 
transaction will be prosecuted. If an address and a 
transaction consistency signal exist, tiien tiie synchroni- 
zation process Indicates that tiie transaction is good. 
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However, if a flag is present and no transaction consist- 
ency signal is present, the transaction is illegal and will 
generate a synchronous interrupt signal for further use 
in the fault management procedure. 

The nullification procedure is illustrated with the s 
synchronization procedure, since it uses the same 
driven error output of 170. The CPU 56 pipelines trans- 
actions through Bills 54 for efficient processing. How- 
ever, situations may occur when a transaction in the 
pipeline is not necessary and must be nullified. The nul- 
lification procedure provides the FTM with the capabili- 
ties of nullifying an unwanted transaction. If a 
nullif ication signal is received by BlU, an output signal is 
sent to the mesh bus 52 to nullify the transaction at 190. 
The nullification signal sent to the mesh bus 52 is then 
qualified by the error signal output of 1 70 and execution 
of the nullification algorithm of 185 and 186 is executed. 
If the consensus of the algorithm is a nullify decision, a 
non-retryable signal is produced and any disagreeing 
BlU is shutdown at 182 (non-retryable errors are 
described below with reference to FIGURE 1 1). The 
nullification algorithm is described in more detail with 
reference to FIGURE 18. Completion of both the nullifi- 
cation and synchronization procedures returns the 
processing to the address phase. 

The other procedure performed in the error detec- 
tion test at 1 10 of FIGURE 5 is the data phase proce- 
dure, shown in FIGURE 8. The data phase procedure 
includes case selection at 1 53 for choosing one of five 
transactions shown as paths 167a-e according to the 
data received tsy the data datapath 59: a read transac- 
tion 167a; a write transaction 167b; an inten"upt transac- 
tion 167c; a scrub transaction 167d; a readback 
transaction 167e. If the case is a read transaction 167a, 
a data read from memory, tiie BlU latches the data read 
from memory at 150, checks it for errors in an enror 
detecting and correcting circuit (EDAC) located witiiin 
datapatii 59 and generates parity at 152, and drives the 
data onto tiie CPU bus 50 at 1 56. If tiie case determined 
is a write transaction 167b, a data write to memory, the 
BlU latches tiie data from the CPU to be written to 
memory at 151, generates error correcting code (ECC) 
at 155 and drives tiie data received from the CPU onto 
the memory bus at 157. The write transaction includes 
a step B tiiat is described In more detail below witii FIG- 
URE 8B. If tiie transaction is an intenrupt 167c. interrupt 
parameters are written to an interrupt register at 158. 
The interrupt transaction initiates from errors requiring 
the FTM and/or BlU to temporarily stop processing. If 
the case is a scrub transaction 167d, error correcting 
code (ECC) is generated at 162 and the BlU drives the 
data to be stored onto the memory bits and asserts 
memory controls at 161. Finally, if tiie case is a read- 
back transaction 167c, the memory controls are 
asserted 163. tiie BlU latches the data sent to memory 
at 166, checks the stored data by EDAC and compares 
that data to the received data from the CPU at 1 65. The 
scrub transaction performs corrections of correctable 
data. Conpletion of each of tiie cases witiiin the data 



phase procedure returns the procedure back to tiie 
decision on page 1 7 at 1 1 0 of FIGURE 5 witii any errors 
determined by tiie data se procedure. 

As shown in FIGURE SB, a check is performed on 
the memory stored data of a write transaction during 
either single CPU channel or single memory channel 
writing to DP RAM at 127. If tiie data written to memory 
is con-ect, tiie process returns to phase processing 110 
of FIGURE 5. ff the data Is inconrect but con^ectable 
(131), thedata is corrected (132). However, if tiie data 
incorrect but not correctable, an error is asserted on tiie 
mesh bus. 

If a retryable or non-retryable en'or(s) exist after 
performance of the phase operations, the fault manage- 
ment procedure 1 16 is initiated, as shown in FIGURES 
9—12. At decision 198. the process moves directly to a 
iaM identification and reconfiguration (FDIR) procedure 
208, if a non-retryable enror was previously discovered. 
As shown in FIGURE 9, tiie system separately retries 
tiie address and data phase procedures at 197-205. At 
step 200. address phase procedure errors are latched 
and saved until conrpietion of tiie data phase procedure. 
This allows the next transaction to start tiie address 
phase procedure when the present transaction starts 
tiie data phase procedure. A preset number of retries 
are performed until no errors are asserted on the mesh 
bus 52, as detemiined by a decision at 205. or tiie retry 
limit is reached, as determined by decision 207. If the 
retry purges tiie retryable error(s). the system proceeds 
to processing the next transaction, [f a fault is still 
asserted on tiie mesh bus 52 after reaching tiie retry 
limit, the system proceeds to the FDIR procedure 208. 

Refenring to FIGURE 10, the FDIR procedure 208 
first checks for address phase enrors at 210-212, then 
data phase errors at 215-217. If an error is detected in 
eitiier of tiie phases at 212 and 217, ttie fault identifica- 
tion (FID) procedure 218 is separately initiated. Essen- 
tially at this.stage of the procedure any enrors present 
are categorized into one of the two phases, address or 
data. The FID procedure 21 8 furtiier isolates any errors, 
witiiin each phase and reconfigures tiie mesh configu- 
ration vector according to any detected errors. 

FIGURE 11 illustrates the FID procedure 218 that 
furtiier isolates any enrors and reconfigures the configu- 
ration vector according to the location of the error(s). 
Each BlU maintains tiie n bit configuration vector that 
indicates the BlU perception of the status of each of the 
n BlUs in tiie FTM. For tiie purpose of tiie present 
invention n=6 because there are six BlUs in tiie 2 by 3 
matrix array The first analysis determines H tiie en-or(s) 
are characterized as self-implicating at 220. If one or 
more self-implicating en-ors exist, ah enror message is 
asserted on the mesh bus 52 from a decision at 221 . All 
BlUs continuously send a phased heartt>eat signal to 
the BlUs via tiie mesh bus 52. During fault-free opera- 
tion the heartbeat signal is continually cycling at a par- 
ticular cycling phase known to all the BlUs. If tiie 
heartiDeat signal of a BlU Is received out of phase, the 
BIU is killed or removed from voting. A non-retryable 
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error is an error thai causes an erroneous change in the 
BlU's heartbeat signal. The self-Implicated BID shuts 
down at 223 and sends a change of its phased heart- 
beat signal to ail the BlUs in the FTM. A phase changed 
heartbeat signal indicates that the configuration vector 5 
will require and must undergo reconfiguration. The indi- 
vidual BID reaction to a changed heartbeat signal is 
described below with reference to FIGURE 20. A config- 
uration consistency procedure 237 performs any neces- 
sary reconfiguration of the configuration vector. The 10 
configuration consistency procedure performs a config- 
uration voting algorithm (CVA) and reassigns the master 
BlUs accordingly, described below with reference to 
FIGURES 19—22. Once conplete. the system returns 
to the FDIR procedure 208 (FIGURE 10) to continue 15 
checking for any other en-ors of the same phase 
(address or data). 

The three other error types are Isolated through 
separate processing passes in the FID procedure. Syn- 
chronization consistency errors, memory bus errors, 20 
and CPU bus enors are isolated and reconfigured 
according to predefined algaithms. The synchronous 
consistency algorithms are desaibed below with refer- 
ence to FIGURES ISA and 19A. The reconfiguration 
algorithm for memory bus errors is described below with 25 
reference to FIGURE 14A, The CPU bus en'or algorithm 
is descrbed below with reference to FIGURE 1 5A. After 
an error is purged by the corresponding algorithm, the 
configuration consistency procedure (FIGURE 12) per- 
forms any necessary reconfiguration followed by a 30 
return to the FDIR procedure, for further error process- 
ing. 

As shown in FIGURE 12, each BIU receives, at 
240, the configuration vector from all the other BlUs. 
The CVA then performs a comparison of each bit within 35 
the configuration vectors at 241 , 242 and 245. First the 
BlUs in the memory channel vote on a bit In the config- 
uration vector at 241 , see below with reference to step 1 
of FIGURE 19A. Then results of the memory channels' 
votes are voted on for each bit In the configuration vec- 4o 
tor at 242 and the BlUs voting dead of a determined 
active BIU is eliminated at 245. see below with refer- 
ence to step 2 of FIGURE 19A. Upon completion of the 
CVA. each BIU compiles a new consensus configura- 
tion vector at 246 and compares this to a prestored ref- 4S 
erence configuration vector at 247, see below with 
reference to FIGURE 19 and 20. If no match is found, 
the mesh bits is asserted with an enor and the configu- 
ration consistency procedure begins again. If the vec- 
tors match, the FID procedure is complete. The so 
consensus vector is saved as new reference vector at 

252, a heartbeat enor is indicated if a BIU is faulty at 

253. Mastership is assigned at 254 and 255 and the 
process is returned to the FDIR procedure. The steps 
and circuitry of the configuration consistency procedure ss 
are described in more detail below with references to 
FIGURES 19-^, 

As noted, the address phase and data phase pro- 
cedures are performed prior to 110 of FIGURE 5 and 



during the fault management procedure (at 197, 201, 
210 and 215 of FIGURES 9 and 10). 

Reconfiguration AloorHhrns 

FIGURE 14A illustrates tiie two step memory bus 
enror algorithm performed at 230 of FIGURE 1 1 . In step 
1, tiie memory channel master BIU rechecks for any 
self-implicating errors, coluinn A.. Elimination of tiie 
master row 2, column B occurs if the master self-impli- 
cates. If tiie'master does not self-implicate, step 2 initi- 
ates row 2, column A. Step 2 is a majority BIU agree or 
disagree vote of tiie data tiie re^ective channel master 
placed on tiie memory bus 51 . If the majority decision of 
the BlUs agree with the memory bus data, the row 1, 
column A transaction proceeds, row 1, column B, and 
any disagreeing BIU is eliminated, row 2, column C. If 
tiie majority BIU decision disagrees with the master 
memory bus data, row 2, column A, tiie transaction is 
retried, row 2. column B, and the minority BIU is elimi- 
nated, row 1 , column C. In the case of a tie, row 3. col- 
umn C, tiie entire memory channel is eliminated, row 3, 
column B. FIGURE 148 enumerates some configura- 
tions and actions. In one example, column B, from FIG- 
URE 14B. the memory channel master does not self- 
implicate itself. The master and one other BIU in tiie 
same memory channel agree, rows 1, 2. and the last 
BIU in tiie same memory channel disagrees, row 3. witii 
tiie data the master placed on tiie memory bus 51. The 
result is tiiat tiie ti-ansaction proceeds and tiie disagree- 
ing BIU is eliminated, row 4. 

FIGURE 15A illustrates tiie two step CPU bus error 
algoritiim performed at 235 of FIGURE 11. Step 1 is 
similar to step 1 of the memory bus en^or algorithm of 
FIGURE 14A except tiie CPU channel's master BIU is 
tiie master BIU of interest Step 2 is a strict majority vote 
of tiie data the master CPU channel BIU places on tiie 
CPU bus 50. Step 2 is similar to step 2 described in FIG- 
URE 14A above. 

FIGURE 15B enumerates some configurations and 
corresponding actions. Column E is described below for 
an example. First, the CPU channel master BIU deter- 
mines no self-implication exists. The CPU channel mas- 
ter BIU agrees with what it placed on the CPU bus 50, 
row*. Only one other BIU is active in the same CPU 
channel and this BIU disagrees, row 2, with the data tiie 
master BIU placed on tiie CPU bus 50. With no majority 
result this CPU channel is eliminated, column 14, and 
no further transaction processing is performed by this 
CPU channel. 

As shown In FIGURE 1 6A. faults indicated by errors 
relating to FTM addressing 1 76a. 1 80a of FIGURE 7 are 
resolved by a two step majority vote of the BlUs 54 in 
tiie FTM. lo step 1. the BIUs 54 on each of tiie CPU 
channels vote if tiiey think tiie CPU 56 is addressing tiie 
FTM. In step 2, the results from all of tiie CPU channel 
votes are voted. Ties in the first step are neutral to the 
outcome of the second step. Specific error information 
relating to the FTM address flag and tiie mesh consist- 
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ency is placed on the mesh bus 52 to nullify the validity 
of the flag and mesh consistency of an othenwise active 
BlU. The results of most cases agree with that of a sim- 
ple majority. As shown In FIGURE 16B, the fault syn- 
drome columns A, B. C represent the CPU columns and 5 
each of the two responses In a column represents a vot- 
ing BlU on a separate memory channel. In the example 
of row 4, a tie is experienced by the addressing votes of 
the BlUs In coliimn A. B CPU channels and a consensus 
yes address vote of the two BlUs in the right CPU chan- 10 
nel column C. The resulting majority CPU channel vote 
is yes, column D. The FTM is being addressed, the 
transaction is initiated and the BlUs that voted not 
addressed, top left and bottom center BlU. are elimi- 
nated, column E. F. The "Comments" columns for FIG- 75 
URES 16-19 are reserved for comments regarding the 
con^esponding row results. 

Minority case exceptions are noted by "minority" 
and "tie breaker" in the comment column. For the minor- 
ity cases, the response is more conservative than a sim- 20 
pie majority vote. For the tie breaker case, the FTM 
remains active because the vote of CPU channel votes 
gives a majority to one of the possible outcomes even 
though the absolute number of BlU votes on each side 
of the question is the same. Note that several resultant 25 
configurations are disjoint, such that there may be two 
single channels with no cross channel data compari- 
sons, although mesh synchronization is still in place. 

The voting algorithm 1 76b. 180b of FIGURE 7 and 
226 of FIGURE 1 1 is shown in FIGURE 1 7A. In step 1 . so 
if the majority GPU channel vote decides that the trans- 
action is inconsistent, row 2, then a memory exception 
is initiated in step 2, row 2. A tie results in CPU channel 
shutdown, row 3, and a consistent majority vote initiates 
the transaction, row 1 , and eliminates the minority vot- 3S 
ing BlUs, row 1. FIGURES 17B-D enumerate the 
generic configurations and actions with an explanation 
on notation at the end of the table. The results of most 
cases agree with that of a simple majority The excep- 
tions are noted by "minority" and "lie breaker" in the 40 
Comment column. For the minority results, the results 
are more consenative than a simple majority vote. As 
with the vote procedure of FIGURE 16A, disjoint config- 
urations are possible as a result of the vote. One exam- 
ple of the above algorithm is shown in row 18 of 45 
FIGURE 178. Five BIUs are able to vote, see columns 
A-C. As in FIGURE 16B. each of columns A-C repre- 
sents the BIUs connected to the CPU bus and the frac- 
tioned row represents a BlU able to vote in one of the 
two memory buses. Column A vote result is a tie. Col- so 
umn B vote is for a consistent transaction and column C 
result is for an inconsistent transaction. As a result of 
the votes above, the CPU channels combined vote is a 
tie because of the one tie, one consistent and one 
inconsistent vole. The tie vote results in the FTM shut- 55 
ting down as indicated by column E-G. 

As shown in FIGURES 18A-F. the nullification vote 
of 182. 185. 186 of FIGURE 7 differs from the previous 
two synchronization consistency enror algorithms in that 



the vote takes place after the action has occunred. The 
decision to nullify or continue, column B of step 2 must 
be made by each BIU without the benefit of a synchro- 
nizing transaction on the mesh bus 52. Therefore, in 
order to maintain synchronization, a post-action vote is 
performed and the disagrees immediately shut down, 
column C. to avoid loss of synchronization, see step 2. 
This must be done because the nullification/continua- 
tion decision results in an In'evocable change of state in 
the system. Nullification occurs for three reasons: the 
CPUs 56 assert a nullification signal; the other FTM 
asserts a hold signal; or self-shutdown of a BIU. Only 
the f rst two cases are significant because the third case 
is self-resolving. A discrepancy in the nullification vote 
due to either nullification or hold signal assertion, 
implies either a data error or the loss of synchronization 
. across the CPU channels. 

FIGURE 18A spedfles the two-step fault identifica- 
tion process for the nullification mesh vote. The nullify 
decision is asserted onto the mesh bus 52 between 
counts of the clock In step 1 of FIGURE 18A. the BIUs 
for each CPU channel are voted, generating votes for 
the CPU channels. In step 2. all of the CPU channel 
votes from step 1 are voted. If a BIU 54 recognizes that 
its nullification decision did not match the decision of the 
mesh vote, then that BIU shuts down column C and 
broadcasts the shut down by changing its heartbeat 
signal. The change in the BlU's heartbeat signal is an 
immediate indication to the FTM that a BIU has killed 
itself. Surviving BIUs detect the heartbeat change and 
perfbnn the configuration consistency procedure to 
generate a new configuration vector. Note that the result 
of the vote generates only an individual en*or, not a new 
FTM configuration. FIGURES 18B-F enumerate some 
generic configurations and corresponding actions with 
an explanation on notation at the end of the table. For 
one example see the row 10 of FIGURE 18B. Tlie left 
and center CPU channels, column A, B, experience a tie 
of the Included BIUs. The right CPU channel vote, col- 
umn C, is to continue, not nullifying the transaction. 
Thus, the combined vote of the CPU channels is to con- 
tinue, column D. The BIUs voting to nullify the continued 
transaction above are eliminated, column E. because 
these BIUs disagree with the result. 

FIGURE 19A illustrates the CVA performed at 241 , 
242 and 245 of FIGURE 12. FIGURES 19B-E enumer- 
ate generic configurations and actions with a notation 
explanation. In step 1. the CVA performs a memory 
channel vote on each vector bit of the configuration vec- 
tors generated by the BIUs 54. In step 2, the memory 
channel votes are combined to form an intermediate 
result The final step, column C is called "dead 
removal." Dead removal refers to the marking as dead 
any BIU that voted the subject BIU dead in a vote in 
which the result had the subject BIU remaining active, 
row 1, column C. If the combination memory channel 
vote results in a majority dead decision, row 2. or a tie 
the BIU is eliminated, row 3. 

The results of most cases agree with that of a sim- 
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pie majority. The exceptions are noted by "ntinority" in 
tiie comment column of FIGURES 19B— E. For those 
exceptions, the present invention is more conservative 
than a simple majority result and vote as inactive the 
unit in question. With only two memory channels there 5 
are no "tie breaker" cases such as those in the other 
mesh votes in which the majority of channel votes yields 
a decision in a case in which the absolute number of 
BlU votes on each side of the question Is the some. 
Most cases are expected to be unanimous In which 10 
case the majority vote is clearly correct. Tile dead 
removal procedure is invoked for a small fraction of 
cases to ensure that whatever caused an inconsistency 
in configuration vectors to appear across the FTM is 
removed. Row 10 of FIGURE 19B is one example of a 15 
result from the CVA. This particular vote is on the left 
upper-most BlU in the FTM. The result of step 1 is both 
memory channels voting the BlU dead. Thus, the result- 
ing combined memory channels* vote is dead and there- 
fore removal of the left most BlU, column D, E. 20 

FIGURE 20 illustrates the configuration consist- 
ency circuitry. The configuration consistency circuitry 
includes a Statein register 92 that outputs data to a CVA 
circuit 93. The output of CVA circuit 93 passes through 
a multiplexer (MUX)98a to a new slate register 91 and 25 
through a second MUX 98b to an old state register 90. 
Also, CVA circuit output and an output of the new state 
register 91 is received by a StateMismatch comparator 
94. Old state register 90 outputs directly to a opti gener- 
ator 95 and through an AND gate 97 to a columnmaster 30 
generator 96. Changed heartbeat signals are also 
received by AND gate 97 and the output of AND gate 97 
is received by the CVA circuit 93. The old stale register 

90 always stores the cun-ent configuration vector. A con- 
figuration is implemented by storing the configuration 35 
vector into the old state register 90. A killmask marks 
either a BlU that has a pending, non-retryable en-or, as 
signaled by a changed heartbeat signal, or any BlU 
whose heartbeat is not phased properly The BlU per- 
forms the reconfiguration procedure and generates a 40 
new configuration vector, it stores that configuration 
vector in a new state register 91 . The new state register 

91 performs a circular shift of the 6 bit configuration vec- 
tor, asserting it onto a StateOut line and ending the shift 
with the vector aligned again within the new state regis- 45 
ter 91 . The StateOut line is passed to all BlUs 54 on the 
mesh bus 52 via a MUX 99a and reenters the circuit at 
the Statein register 92. 

At this point, each BlU 54 receives at Statein regis- 
ter 92 a copy of the configuration vectors that each BlU so 
54 in the FTM has generated and stored in the new 
state register 91. The CVA, as defined previously in 
FIGURE 19A. is invoked and a new. composite configu- 
ration vector Is generated from the data in the .Statein 
register 92 and any possible killed BlUs as received ss 
through AND gate 97. If the composite configuration 
vector matches the contents of the new state register 
91 , as determined at StateMismatch connparator 94, the 
process is complete and the new configuration is imple- 



mented. If. however, an active BlU has a miscompare at 
comparator 94 indicating that the composite configura- 
tion vector does not match what is stored in its new 
state register 91, the comparator asserts StateMis- 
match onto syndrome line of the mesh bus 52 via a 
MUX 99b. The new composite configuration vector is 
clocked into state register 91 by MUX 98b, the previous 
vector stored in a new state register 91 is pushed to the 
mesh bus 52 and the procedure is repeated until no 
miscompare exists at comparator 94. At that point, con- 
vergence is achieved with 

The construction of the circuitry of FIGURE 20 is 
driven by the need to handle conditions that may occur 
during initialization. As shown in FIGURE 20, for initiali- 
zation of the mesh of this invention, the new state regis- 
ter 91 is loaded through MUX 96a with a prestored 
configuration vector retrieved from the EEPROM and 
the configuration consistency procedure is invoked 
using thai vector as an Initial value. It is possible that the 
configuration vector retrieved by the BlUs of one mem- 
ory channel does not match the vector retrieved by the 
BlUs of other memory channels. Consider the case in 
which an entire memory channel is shut down during 
operation. The EEPROM from that memory channel 
does not have the updated configuration vector stored 
into it because that channel can no longer respond to 
transactions. It contains a configuration vector that has 
some of the BlUs of the inactive channel marked as still 
active. In the initialization procedure, all BlUs 54 start as 
active until shown otherwise. Thus, the next time that 
the system Is initialized, the previously deactivated 
memory channel Is active until either individual checte 
or the configuration consistency procedure illustrate it to 
be inactive. By weighting the mesh configuration vote to 
require a clear majority vote of memory channels, the 
inactive status of tiie bad memory channel is deter- 
mined if only one of the memory channels contains the 
accurate vector indicating that the bad channel should 
be inactive. Conversely, for a BlU 54 to remain active, 
the final vote of memory channels must be explicitly 
positive. A vote by either memory channel to kill a BlU 
results In the deactivation of that BlU. TTierefbre. con- 
sistency is maintained. 

After successful determination of the configuration 
vector, the assignment of memory channel and CPU 
channel masters is determined as a function of the con- 
figuration vector. In order to avoid bus contention prob- 
lems, the Mastershp status is determined from the old 
state register 90 contents which should be consistent. 
The algorithm for determining memory channel Master- 
ship is shown in FIGURE 21 . When the master memory 
channel BlU is deactivated, mastership assignment 
increments column position until an active BlU is 
reached. Memory channel mastership is simple 
because memory channel master assignments have 
priority over CPU channel master assignments. 

The rules for determining CPU channel mastership 
are more complicated because the CPU channel mas- 
tershp may be passed from a still-active BlU to its col- 
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umn pair in order to avoid the situation in which a BlU Is 
l^oth memory channel and CPU channel master. The 
algorithm for determining CPU channel mastership is 
shown in FIGURE 22. Three possible cases exist if both 
BlUs in a CPU channel are active. In case 1 neither BlU 5 
is the memory channel master; therefore, the CPU 
channel master retains mastership. Case 2 has the 
same result when both BlUs are memory channel mas- 
ters. In case 3 one BlU is a memory channel nnaster and 
CPU mastership assignment goes to the nonmemory 10 
channel master. When a single BlU is active in the CPU 
channel, it Is the master. 

The present invention provides improved fault iden- 
tification and reconfiguration performed in a mesh array 
computer architecture because of a number of impor- is 
tant factors previously described. This system's ability 
to categorize errors as to their type and location and to 2. 
reconfigure the mesh architecture according to the iden- 
tified errors, are two features that enable the inrprove- 
ments of this invention. so 

While the preferred embodiment of the invention 
has been illustrated and described, it will be appreciated 
that various changes can be made therein without 
departing from the spirit and scope of the invention. 

2S 3. 

Claims 

1. A fault-tolerant computer system, including one or 
more central processing units (CPUs) each con- 
nected to a CPU bus and one or more memory unit 30 
each connected to a memory bus, said CPU 
bus(es} intersecting said memory bus(es}, said sys- 
tem further comprising: 4. 

bus Interface units (BlUs) located at the Inter- 3S 
sections of the CPU bus(es) and the memory 

bus(es); 

a transaction means for transmitting a transac- 
tion comprising at least one of address data, 
read data, write data, and control data; 40 
a mesh bus for interconnecting predetermined 
BlUs; 

said BlUs further comprising: 

a means for storing a configuration vector 4S 5. 
representir^ a BlU's knowledge of the 
operating status of the BlUs connected to 
the mesh bus; 

an error detecting means for detecting cor- 
rectable, retryable and non-retryable errors so 
In said data of a transaction; 
an en-or correcting means for correcting 
any detected correctable errors; 
an error asserting means for asserting to 
all the BlUs via the mesh bus if a retryable ss 
or non-retryable error(s) was detected by 
said error detecting means; 
a retry means for sending the data bacl< to 
said error detecting means a preset 6. 



number of times, if a retryable error 
remains asserted on the mesh bus; 
a fault management means for Isolating 
and eOminating said BIU(s) with any 
detected error(s) remaining after said retry 
means by reconfiguring the configuration 
vector of each BlU connected on the mesh 
bus according to the type of remaining 
detected error(s) and generating a consen- 
sus configuration vector according to the 
reconfigured configuration vector of each 
BlU; and 

a continuation means for completing the 
transaction if no en'orCs) remain asserted 
on the mesh bus. 

The fault-tolerant computer system of daim 1, 
wherein said en^r detecting means further com- 
prises: 

a single thread read back means for detecting 
errors of data written to memory when only a 
single CPU channel is active. 

The fault tolerant computer system of daim 1 or 2, 
wherein said error detecting means further com- 
prises: 

a single thread readbacK means for detecting 
errors of data written in memory when only dis- 
joint BlUS are active. 

A fault tolerant computer system according to daim 
1, 2 or 3, wherein said fault management means 
further comprises: 

a self-implicating enor means within the fault 
management means of each BlU for detecting 
self-implicating en-ors, shutting down each BlU 
that detects self-implicating en^or, asserting the 
configuration vector of each BlU onto the mesh 
bus, and generating a consensus configuration 
vector. 

A fault tolerant conputer system according to ariy 
of claims 1-4, wherein said fault management 
means comprises: 

a synchronization error means within the fault 
management means of each BlU for detecting 
synchronization errors, asserting a first error 
message to the mesh bus if a synchronization 
error was detected, performing a first reconfig- 
uration algorithm if a first en'or message Is 
asserted on the mesh bus and generating a 
consensus configuration vector according to 
the results of the first reconfiguration algorithm. 

A fault tolerant computer system according to any 
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of ciaims 1-5. wherein said fault management 
means comprises: 

memory bus en^or means within the fault man- 
agement means of each BlU for detecting 5 
memory bus errors, asserting a second error 
message to the mesh bus if a memory bus 
error was detected, performing a second 
reconfiguration algorithm if a second error 
message is asserted on the mesh bus and io 
generating a consensus configuration vector 
according to the results of the second reconfig- 
uration algorithm. 

7. A fault tolerant computer system according to any is 
of claims 1-6, wherein said fault management 
means comprises: 

CPU bus error means within the fault manage- 
ment means of each BlU for detecting CPU bus 20 
enors, asserting a third error message to the 
mesh bus If a CPU bus error was detected, per- 
forming a third recorrfiguration algorithm If a 
third error message is asserted on the mesh 
bus and generating a consensus configuration 25 
vector according to the results of the third 
reconfiguration algorithm. 

8. A fault-tolerant computer system according to any 

of claims 1-7. wherein said fault management 30 
means transmits data through the en^or detecting 
means in the following order: 

the self -implicating error means; 

the synchronization error means; 3S 

the memory bus error means; and 

the CPU bus en^or means. 

9. A fault-tolerant computing method, including one or 
more central processing units (CPUs) each con- 40 
nected to a CPU bus and one or more memory 
units each connected to a memory bus, each CPU 
bus intersects each memory bus, said method com- 
prising the steps of: 

45 

transmitting a transaction comprising at least 
one of address data, read data, write data, or 
control data; 

connecting predetermined BlUs by a mesh 
bus; so 
detecting within said BlUs con^ectable. retrya- 
ble and non-retryable enors in said date of the 
transaction; 

conrecting any conectabie errors detected; 
asserting an error message to all the BlUs via ss 
the mesh bus if a retryable or non-retryable 
error(s) was detected; 

generating a configuration vector for each BiU, 
said configuration vector representing the 



BlU*s knowledge of the operating status of all 
BlUs connected to the mesh bus; 
completing the transaction if no error mes- 
sages are asserted on the mesh bus; 
sending the data back to said error detecting 
st^ a preset nunt)er of times if a retryable 
error(s) remains asserted on the mesh bus; 
isolating and eliminating within said BlUs any 
remaining retryable or non-retryable error(s) by 
reconfiguring the configuration vector of each 
BIU connected on the mesh bus according to 
the type of detected error(s) and generating a 
consensus configuration vector according to 
the reconfigured configuration vector; and 
completing the transaction according to the 
reconfigured configuration vector. 

10. A fault-tolerant computing method according to any 
of claims 1 -9. further comprising the steps of: 

storing the consensus configuration vector; 
and 

regenerating a consensus configuration vector, 
if tine consensus configuration vector does not 
match a previously stored consensus configu- 
ration vector. 

11. A fault-tolerant computer system, including a cen- 
ti-al processing unit (CPU) connected to a vertical 
CPU bus and at least one memory unit connected 
to a horizontal memory bus, said system further 
comprising: 

a bus interface unit (BIU). wherein tiie BIU is 
located at an intersection of the vertical GPU 
bus and the horizontal memory bus; 
a transaction means for transmitting a transac- 
tion comprising at least one of address data, 
read data, write data, or control data; 
said BIU further comprising: 

an error detecting means for detecting cor- 
rectable and unconectable errors in said 
transmitted data of a single transaction, 
wherein said step of detecting furtfier com- 
prises the steps of: 

a read back means for reading back said 
data of the ti-ansaction if the transaction is 
a write transaction and detecting the read 
back data for correctable and uncorrecta- 
ble errors; 

an error correcting means for correcting 
any correctable enrors detected; 
a retry means for sending the date back to 
said enror detecting means a preset 
number of times if an uncon^ectable 
error(s) is detected; 

a shutdown means for shutting down the 
BIU if an uncon-ectable error(s) remains 
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after completion of the retry means; and 
a continuation means for completing the 
transaction if said BiU is not shutdown. 

12. A fault-tolerant computing method, including a cen- s 
tral processing unit (CPU) connected to a vertical 
CPU bus and at least one memory unit connected 
to a horizontal memory bus, said buses intersect in 
a mesh format, said method comprising the steps 
of: ^0 

connecting a bus interface unit (BIU) at the 
intersection of the buses; 
transmitting a transaction comprising at least 
one of address data, read data, write data, or is 
control data; 

detecting within said BIU con-ectable and 
uncorrectable en'ors in said transmitted data of 
a single transaction, wherein said step of 
detecting further comprises the steps of: so 

reading back said data of the completed 
transaction if the transaction is a write 
transaction and detecting the read back 
data for con-ectable and unconrectable 25 
en'ors, connecting the data if the check indi- 
cates a con-ectable error is present: 
correcting within said BIU if any correcta- 
ble errors detected; 

sending the data back to said error detect- so 
ing step a preset number of times if an 
uncorrectable error(s) is detected; 
shutting down the BIU if an uncorrectable 
en-or remains after completion of the istep 
of sending the data back; and ss 
completing the transaction if said BIU is 
not shutdown. 
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NOTATION: 

A) FAULT SYNDROMS: FLAG YEUES FOR CPU CHANNEL J BIUs 

Y - BIV THINKS TRANSACTION IS CONSISTENT 
N - BW THINKS TRANSACTION IS INCONSISTENT 

B) RESULTS: C - CONTINUE OPERATION 

M - mOKE MEXC 
S - SHUT DOWN 

S(S:Y^N) - SHUT DOWN (THE RESULT WITH M Y's AND N's 
Sf ITCHED rOULD ALSO BE SHUT DOKf) 

C) NSn STATE: SJ - DESIRED NEXT STATE FOR CPU CHANNEL J BIUs 

1/1 - UNITS IN MEMORY CHANNELS 1 & 0 REMAIN ACTIVE 
1/0 - UNIT IN MEMORY CHANNEL 1 REMAINS ACTIVE, 
UNIT IN MEMORY CHANNEL 0 IS DEAD 



33 



EP0811 916A2 



STEP 1 CPU CMUHEL VOTE 



MAJOm DECISION 



NVLUFY 



CONTINUE 



TIE 



STEP 2 
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NOTATION: 



A) FAULT SYNDROME: NUHJ - FLAG VALUES FOR CPU CHANNELJi BlUa AS RECTO BY BlUi 
N - BIU NULUFIEB TRANSACTION ' 
C - BIU CONTINUED KITE TRANSACTION 



B) RESULTS: 



C) NEH STATE: 



C - CONTINUE TRANSACTION 
N - NULUFY TRANSACTION 
S - SHUT DOfN 

S(S:N^C) - SHUT DOfN (THE RESULT NITB ML N'a AND C's 
SflTCHED ¥OULD ALSO BE SHUT DOfN) 



/ 
0 

I - SCENERIO NOT POSSIBLE CPfEN NUUIFY FLAG VALVES 



BWj REMAINS ACPm 
BWj SHUTS Don 
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MOTATIOS: 

A) FAULT SYNDROME: CONJ - CONFICURiTION STATUS VALUE FROM CPU CEAMELJi BlUs 

A - BlU Turns UPPER LEFTMOST BlU IS ACTIVE 
D - BIU THINKS UPPER LEFTMOST BW IS BEAB 
X - DON'T CARS 

B) RESULTS: S - LEFTMOST BIU SURVIVES 

K-mLEmOSTBIU 

C) NEH STATE: S_2 - DESIRED NEXT STATE FOR CPU CBANNELJ BIUs 

i/i - UNITS IN MEMORY CHANNELS 1 ic 0 REMAIN ACTIVE 
1/0 - UNIT IN MEMORY CHANNEL 1 REMAINS ACTIVE. 

UNIT IN MEMORY CHANNEL 0 IS DEAD 
X - USED IN EXAMPLE IN fHICH TARCET BIU WAS KILLED BY DEAD 
REMOYU. ACTUAL VALUE VOULD DEPEND UPON FAULT 
SYNDROME INPUT ACCORDING TO MESH VOTE RULES. 

NOTE: BOLD TYPE ENTRIES IN NEXT STATE COLUMN HERE 
REMOVED BY DEAD REMOVAL PROCEDURE 
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A. CASE 


B. ALGORITHM 


DEFAULT MEMORY CHANNEL 
MASTERS 


(0,2) (1.1) 

A BIU fILL REMAIN MASTER AS LONG AS IT REMAINS ACTIYE 


mN MASTER BW IS 
DSACnVATED 


INCREMENT COLUMN POSITION OF PREVIOUS MASTER 
(MODULE 3) UNTIL AN ACTIVE BIU IS REACHED 
SEQUENCES: 

(0.2) - > (0.0) - > (0,1) 
(1.1) - > (1.2) - > (UO) 






A. CASE 


B. ALCORirnM 




. DEFAVLT CPU CHANNEL 

MASTERS 


(1.0) (0.1) (1.2) 

A BIU MAY LOSE MASTERSHIP WHILE STIU ACTIYE 




. BOTB BIUs IN CPU CHANNEL 

Acrm 

CASS 1) 

NEITHER BIU IS MEMORY 
CBANm MASTER 


DEFAULT CPU CHAHNEL MASTER RETAINS MASTERSEIP 




CASE 2) 

BOTE BIUs ARE MEMORY 
CHANNEL MASTER 


DEFAULT CPU CHANNEL MASTER RETAINS MASTERSHIP 




CASE 3) 

ONE BIU IS MEMORY 
CHANNEL MASTER 


NON MEMORY CHANNEL MASTER BECOMES CPU 
CHANNEL MASTER 




ONE BIU IN CPU 
CHANNEL ACTIVE 


ACTIYE BIU IN CPU CHANNEL IS CPU 
CHANNEL MASTER 
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(54) Mesh interconnected array in a fauH-tolerant computer system 



(57) Bus internee units (BlUs) (54) perform fault 
detection, identification, and reconfiguration for all Infor- 
mation transfers between redundant central processing 
units (CPUs) (56) and memory or input/output (I/O) 
(57A-C) in a mesh interconnected anray of a highly reli- 
able fault-tolerant computer system. Errors are detected 
by self-checking within the BiUs, signal parity checks by 
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the BlUs. cross channel comparisons, and mesh trans- 
action assessments. Fault identification and mesh 
reconfiguration for the mesh Is performed such that no 
faulty unit remains active in decision making after recon- 
figuration, and the number of good units isolated during 
reconfiguration is minimized. 
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